Data manupulation in R

Bolin Wu - KI

2023-01-24 (updated: 2023-01-24)

# run the install command if did not do it before
# install.packages('remotes')
# remotes::install_github("Bolin-Wu/workshopr", subdir = "rpackage", force = TRUE)

# load the package
library(workshopr)

library(tidyverse)
library(here)

Introduction

This document is share useful data manipulation skills at daily epidemiology work. My main goal is to follow the “dont’t repeat yourself” (DRY) principle. It can make our code more readable and reduce our chance of making mistakes.

The content includes:

  • transform data shape
  • get the label from DTA and SPSS in R
  • assign function
  • apply, lapply, sapply
  • filter variables based on name pattern
  • check missing values
  • bin variables by percentiles

Transform data shape

Transform data shape for many people, including me, sounds troublesome. Within the R world, its relevant functions are evolving overtime as well.

In the beginning, I used spread and gather. Every time I use spread and gather, it takes me a while to figure out how to fill in ‘key’ and ‘value’. But as you can see from their documentation, their ‘lifecycle’ is ‘superseded’.

Now I only use pivot_longer and pivot_wider for transforming data. You can find their comprehensive documentation here. They come with better documentation, more powerful application, and better integration with tidyverse syntax.

Example

Let’s assumme we received a wide format data:

head(fake_snack_df)

The column names are:

sort(colnames(fake_snack_df))
##  [1] "Date_wave1"     "Date_wave2"     "Date_wave3"     "Date_wave4"    
##  [5] "Date_wave5"     "Date_wave6"     "dementia_wave1" "dementia_wave2"
##  [9] "dementia_wave3" "dementia_wave4" "dementia_wave5" "dementia_wave6"
## [13] "education"      "Lopnr"          "sex"

Now, assume for some reason, e.g. merge it with other data set, we want to transform it in a long format. There are two variables should be formated: ‘Date’ and ‘dementia’. For beginners, I would recommend to start small.

  • Select the interested columns
fake_snack_df %>% 
  select(contains("Date"))
  • Read documentation, try to fill in the arguments. The 3 basic arguments are:
    • cols tells R what variables to pivot.
    • names_to defines a new name for columns in cols.
    • values_to gives a new name for values under the columns in cols.

Let’s give a first try:

# check pivot_longer() documentation in R: ?pivot_longer()
fake_snack_df %>% 
  select(Lopnr,contains("Date")) %>% 
  pivot_longer(cols = contains("Date"),
               names_to = "wave", names_prefix = "Date",
               values_to = "date")

The result above looks good, but ‘wave’ looks a bit strange. I will leave the task to audience to fix this column.

  • Do the same with ‘dementia’ columns
fake_snack_df %>% 
  select(Lopnr,contains("dementia")) %>% 
  pivot_longer(cols = contains("dementia"),
               names_to = "wave", names_prefix = "dementia",
               values_to = "dementia")

Task

  1. Change the arguments in the pivot_longer function to get proper wave column. Result example:
  2. Merge the two long pivot dataset together.
  3. Optional: Pivot the original ‘fake_snack_df’ by using only one pivot_longer function. Put wave and date after ‘Lopnr’ column. Hint: check the names_to argument and relocate() function. Result example: